Please add in appropriate graph titles and axis labels yourself.
You will need to install the packages car and leaps if you have not yet.
We will use a subset of the bdims dataset in openintro package. The dataset that we will use is called d (see below).
library(openintro)
dim(bdims)
[1] 507 25
names(bdims)
[1] "bia.di" "bii.di" "bit.di" "che.de" "che.di" "elb.di"
[7] "wri.di" "kne.di" "ank.di" "sho.gi" "che.gi" "wai.gi"
[13] "nav.gi" "hip.gi" "thi.gi" "bic.gi" "for.gi" "kne.gi"
[19] "cal.gi" "ank.gi" "wri.gi" "age" "wgt" "hgt"
[25] "sex"
# Creating dataset for the demo
d = bdims[,c("wgt", "sex", "age", "sho.gi", "che.gi",
"thi.gi", "bic.gi", "for.gi", "kne.gi")]
# See descriptions of the interested variables
?bdims
str(d)
'data.frame': 507 obs. of 9 variables:
$ wgt : num 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
$ sex : int 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 21 23 28 23 22 21 26 27 23 21 ...
$ sho.gi: num 106 110 115 104 108 ...
$ che.gi: num 89.5 97 97.5 97 97.5 ...
$ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
$ bic.gi: num 32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
$ for.gi: num 26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
$ kne.gi: num 34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
# sex is not a factor when it is supposed to be
# a categorical variable; let's change it to be
# a factor
d$sex = as.factor(d$sex)
str(d)
'data.frame': 507 obs. of 9 variables:
$ wgt : num 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
$ sex : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ age : int 21 23 28 23 22 21 26 27 23 21 ...
$ sho.gi: num 106 110 115 104 108 ...
$ che.gi: num 89.5 97 97.5 97 97.5 ...
$ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
$ bic.gi: num 32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
$ for.gi: num 26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
$ kne.gi: num 34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
# The data types of other variables seem correct.
# If needed, we can always modify the data types later.
We want to build a linear model with all, or a subset, of the variables in our dataset to predict the Weight of a person.
GGally package to graph the pairwise relationships between the variables and make the plots for male and female separately. Why do we want to make the plots for male and female separately?
d$sex = as.factor(d$sex)
library(ggplot2)
library(GGally)
# Instructors, please explain what the graph and the numbers
# on the graph mean
ggpairs(d[, c(2:9, 1)], aes(colour = sex, alpha = 0.4),
upper = list(continuous = wrap("cor", size = 2.2)),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.1))) +
theme(axis.text.x = element_text(angle = 45))
Answer the following questions with the scatterplot matrix above.
sex be included as one of the variables in the model?Yes, since the variable wgt seems to have different distribution for different values of sex.
che.gi and sho.gi and bic.gi and for.gi?Most likely not; they have strong pairwise correlations (all around or above .9). Collinearity between the variables will make the model unidentifiable. (Preceptors: students do not understand what collinearity mean: please explain in simple terms; e.g., there are infinitely many possible combination values for the estimated betas. This makes interpretation of the beta’s infeasible.)
age to be a better predictor compared to other continuous variables in the dataset?Probably not since the scatterplot for weight v.s. age does not seem very linear. The correlation between the two variable is also low (which is expected from the scatterplot).
regsubsets() function in the leap package to perform model selections with the BIC, adjusted R-squared, and Mallow’s Cp criteria, searching through all the possible subsets of the predictors. (Preceptor: Please explain what regsubsets() does.)library(leaps)
# library(car)
str(d)
'data.frame': 507 obs. of 9 variables:
$ wgt : num 65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
$ sex : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ age : int 21 23 28 23 22 21 26 27 23 21 ...
$ sho.gi: num 106 110 115 104 108 ...
$ che.gi: num 89.5 97 97.5 97 97.5 ...
$ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
$ bic.gi: num 32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
$ for.gi: num 26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
$ kne.gi: num 34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
# For each given number of predictors, find the best model
g <- regsubsets(wgt~., data=d, really.big=F, method='exhaustive', nvmax=8)
# set really.big=T if your design matrix is very big;
# can change option for method to forward selection, backward selection or sequential replacement to search when dataset is very big.
plot(summary(g)$bic,
main = 'BIC for best model for each given number of predictors',
xlab = 'Given number of predictors',
ylab = 'BIC')
plot(summary(g)$adjr2, main = 'Adjusted R-square for best model \nfor each given number of predictors', xlab = 'Given number of predictors', ylab = 'Adjusted R-square') # adjusted R^2 chooses a different model
plot(summary(g)$cp, main = 'Mallows Cp for best model for each given number of predictors', xlab = 'Given number of predictors', ylab = 'Mallows Cp')
abline(a = 1, b = 1, lty = "dashed")
Check what variables are included in each “best” model.
summary.g <- summary(g)
# The best model for each number of predictors included in model
as.data.frame(summary.g$outmat)
Another way to find out the best model with 4 predictors and that with 5 predictors:
coef(g, id = 4)
(Intercept) sex1 che.gi thi.gi kne.gi
-83.6121786 5.1971057 0.6746304 0.6395871 1.4059045
coef(g, id = 5)
(Intercept) sex1 sho.gi che.gi thi.gi
-87.4847777 4.2805446 0.1742738 0.5549794 0.6234783
kne.gi
1.3381446
It looks like all the procedures that we used point us toward two possible models (in R’s notation):
Model 1: wgt ~ sex + che.gi + thi.gi + kne.gi
Model 2: wgt ~ sex + sho.gi + che.gi + thi.gi + kne.gi
mod1 = lm(wgt ~ sex + che.gi + thi.gi + kne.gi, data=d)
summary(mod1)
Call:
lm(formula = wgt ~ sex + che.gi + thi.gi + kne.gi, data = d)
Residuals:
Min 1Q Median 3Q Max
-11.345 -2.515 -0.218 2.201 12.138
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -83.61218 2.61177 -32.014 < 2e-16 ***
sex1 5.19711 0.61496 8.451 3.14e-16 ***
che.gi 0.67463 0.03247 20.774 < 2e-16 ***
thi.gi 0.63959 0.05904 10.834 < 2e-16 ***
kne.gi 1.40590 0.09969 14.103 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.728 on 502 degrees of freedom
Multiple R-squared: 0.9226, Adjusted R-squared: 0.922
F-statistic: 1496 on 4 and 502 DF, p-value: < 2.2e-16
mod2 = lm(wgt ~ sex + sho.gi + che.gi + thi.gi + kne.gi, data=d)
summary(mod2)
Call:
lm(formula = wgt ~ sex + sho.gi + che.gi + thi.gi + kne.gi, data = d)
Residuals:
Min 1Q Median 3Q Max
-11.9278 -2.5046 -0.2666 2.0874 12.2488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.48478 2.78306 -31.435 < 2e-16 ***
sex1 4.28054 0.65577 6.527 1.64e-10 ***
sho.gi 0.17427 0.04704 3.705 0.000235 ***
che.gi 0.55498 0.04552 12.193 < 2e-16 ***
thi.gi 0.62348 0.05846 10.664 < 2e-16 ***
kne.gi 1.33814 0.10013 13.364 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.682 on 501 degrees of freedom
Multiple R-squared: 0.9247, Adjusted R-squared: 0.9239
F-statistic: 1230 on 5 and 501 DF, p-value: < 2.2e-16
The function vcov() gives the pairwise covariance between the estimates of two coefficients; e.g., the 3rd row and 4th column of vcov(mod2) gives the covariance of \(\hat{\beta}_{sho.gi}\) and \(\hat{\beta}_{che.gi}\). The numbers on the diagonal of vcov(mod2) give the variances of the \(\beta\) estimates; e.g., the 3rd row and 3rd column of vcov(mod2) gives us the variance of \(\hat{\beta}_{sho.gi}\).
round(vcov(mod1), 4)
(Intercept) sex1 che.gi thi.gi kne.gi
(Intercept) 6.8213 0.5443 -0.0264 -0.0112 -0.1093
sex1 0.5443 0.3782 -0.0153 0.0208 -0.0135
che.gi -0.0264 -0.0153 0.0011 -0.0008 -0.0005
thi.gi -0.0112 0.0208 -0.0008 0.0035 -0.0034
kne.gi -0.1093 -0.0135 -0.0005 -0.0034 0.0099
round(vcov(mod2), 4)
(Intercept) sex1 sho.gi che.gi thi.gi
(Intercept) 7.7454 0.7895 -0.0492 0.0080 -0.0064
sex1 0.7895 0.4300 -0.0116 -0.0069 0.0214
sho.gi -0.0492 -0.0116 0.0022 -0.0015 -0.0002
che.gi 0.0080 -0.0069 -0.0015 0.0021 -0.0006
thi.gi -0.0064 0.0214 -0.0002 -0.0006 0.0034
kne.gi -0.0875 -0.0087 -0.0009 0.0001 -0.0032
kne.gi
(Intercept) -0.0875
sex1 -0.0087
sho.gi -0.0009
che.gi 0.0001
thi.gi -0.0032
kne.gi 0.0100
Note that the correlation of two variables X and Y is defined as
\[ cor(X, Y) = \frac{cov(X,Y)}{sd(X)sd(Y)} \]
We already saw in part (c) that the sho.gi and che.gi are highly positively correlated and as a result the \(\hat{\beta}_{sho.gi}\) and \(\hat{\beta}_{che.gi}\) are highly negatively correlated.
# a correlation(beta_i, beta_j) close to 1 or -1 is bad since it implies
# collinearity between X_i and X_j columns;
# we do not want to see high values for the correlations.
# The correlation between beta_hat.sho.gi and beta_hat.che.gi is
vcov(mod2)["sho.gi", "che.gi"]/sqrt(vcov(mod2)["sho.gi", "sho.gi"])/sqrt(vcov(mod2)["che.gi", "che.gi"])
[1] -0.7095983
# this is high enough to get our attention
Also, since including the extra term sho.gi does not increase adjusted R-squared much, we favor model 1.
plot(mod1, which=1:2)
# which = 1 gives the residuals v.s. fitted values scatterplot
# which = 2 gives the qqplot for the residuals
hist(mod1$residuals, freq=F, breaks=30)
The residuals seem have a distribution that is close to Normal so that is a good sign. However, we see that the residuals have a slightly curve up (like a parabola opening up) pattern. Any non-random pattern in residual v.s. fitted y-value plot is not desirable; however, this pattern is not very strong so this is not too bad. We will try to improve on this in part (h).
ggpairs(d[,c('sex', 'che.gi', 'thi.gi','kne.gi', 'wgt')],
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.1)),
aes(colour = sex, alpha = 0.4))
None of the three x-variables seem to have very different slopes for the two gender groups on the scatterplots for wgt v.s. the x-variables. According to the scatterplots thi.gi might be the one that seem to be more likely to have different slopes for females and males. Therefore, we will try to add the interaction term thi.gi:sex.
thi.gi:sex to your model and fit the model again. Is the model with the interaction term an improvement on the model without?Looking at the relationship between weights and thigh girth on the scatterplot we see that the model could benefit from including an interaction term. Lets compare the model with and the model without the interaction term:
mod3 = lm(wgt ~ sex + che.gi + thi.gi + kne.gi + sex:thi.gi, data=d)
# compare R-squared's
summary(mod3)
Call:
lm(formula = wgt ~ sex + che.gi + thi.gi + kne.gi + sex:thi.gi,
data = d)
Residuals:
Min 1Q Median 3Q Max
-11.4106 -2.4585 -0.2282 2.2415 12.0997
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -79.88735 3.19529 -25.002 < 2e-16 ***
sex1 -3.38022 4.30838 -0.785 0.4331
che.gi 0.67014 0.03245 20.650 < 2e-16 ***
thi.gi 0.56871 0.06860 8.290 1.05e-15 ***
kne.gi 1.42619 0.09990 14.277 < 2e-16 ***
sex1:thi.gi 0.15143 0.07529 2.011 0.0448 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.717 on 501 degrees of freedom
Multiple R-squared: 0.9232, Adjusted R-squared: 0.9224
F-statistic: 1205 on 5 and 501 DF, p-value: < 2.2e-16
summary(mod1)
Call:
lm(formula = wgt ~ sex + che.gi + thi.gi + kne.gi, data = d)
Residuals:
Min 1Q Median 3Q Max
-11.345 -2.515 -0.218 2.201 12.138
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -83.61218 2.61177 -32.014 < 2e-16 ***
sex1 5.19711 0.61496 8.451 3.14e-16 ***
che.gi 0.67463 0.03247 20.774 < 2e-16 ***
thi.gi 0.63959 0.05904 10.834 < 2e-16 ***
kne.gi 1.40590 0.09969 14.103 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.728 on 502 degrees of freedom
Multiple R-squared: 0.9226, Adjusted R-squared: 0.922
F-statistic: 1496 on 4 and 502 DF, p-value: < 2.2e-16
Adjusted R-squared values are about the same for the two models.
# look at p-value which is on the borderline of .05
anova(mod1, mod3)
p-value is on the borderline of 5% so no conclusion can be made about whether the coefficient for the interaction is zero or not.
Let’s compare the BIC values:
# Compare BICs
extractAIC(mod1, k=log(dim(d)[1]))
[1] 5.000 1360.404
extractAIC(mod3, k=log(dim(d)[1])) # This actually increase BIC
[1] 6.000 1362.556
Adding the interaction term actually increases the BIC value. Thus, for simplicity we favor model 1.
head(model.matrix(mod1))
(Intercept) sex1 che.gi thi.gi kne.gi
1 1 1 89.5 51.5 34.5
2 1 1 97.0 51.5 36.5
3 1 1 97.5 57.3 37.0
4 1 1 97.0 53.0 37.0
5 1 1 97.5 55.4 37.7
6 1 1 99.9 57.5 36.6
colnames(model.matrix(mod1))
[1] "(Intercept)" "sex1" "che.gi" "thi.gi"
[5] "kne.gi"
The mathematical model for model 1 is
\[ Wight = \beta_0 + \beta_{1_{male}}1_{male} + \beta_{che.gi}che.gi +\beta_{thi.gi}thi.gi +\beta_{kne.gi}kne.gi + error \]
Please fill in the missing graphic titles and labels yourself.
Now it is your turn! Use the dataset below and build a linear model with the variables in the dataset to predict the Height of a person. Choose your champion model based on the BIC, Adjusted \(R^2\) and Mallow’s Cp values (we will not use cross-validation here). Looking at the residual plots of your champion model to see if model assumptions are violated. Consider adding possible interaction terms to improve your model.
Note: You can turn off the warnings produced by ggpairs() by setting message = F, warning = F for the code chunk.
b = bdims[,c("hgt", "sex", "age", "bii.di", "bit.di", "che.de", "kne.di", "ank.di",
"wai.gi", "nav.gi", "hip.gi", "thi.gi")]
dim(b)
[1] 507 12
str(b)
'data.frame': 507 obs. of 12 variables:
$ hgt : num 174 175 194 186 187 ...
$ sex : int 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 21 23 28 23 22 21 26 27 23 21 ...
$ bii.di: num 26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ...
$ bit.di: num 31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ...
$ che.de: num 17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ...
$ kne.di: num 18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ...
$ ank.di: num 14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ...
$ wai.gi: num 71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
$ nav.gi: num 74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ...
$ hip.gi: num 93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
$ thi.gi: num 51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
b$sex = as.factor(b$sex)
# There are too many variables for plotting them all
# on the same graph, so we will make two graphs
ggpairs(b[,c(2, 3:7, 1)], aes(colour = sex, alpha = 0.4),
upper = list(continuous = wrap("cor", size = 2.4)),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.1))) +
theme(axis.text.x = element_text(angle = 45, size = 6))
ggpairs(b[,c(2, 8:12, 1)], aes(colour = sex, alpha = 0.4),
upper = list(continuous = wrap("cor", size = 2.4)),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.1))) +
theme(axis.text.x = element_text(angle = 45, size = 6))
g <- regsubsets(hgt~., data=b, really.big=F, method='exhaustive', nvmax=12)
plot(summary(g)$bic,
main = 'BIC for best model for each given number of predictors',
xlab = 'Given number of predictors',
ylab = 'BIC')
plot(summary(g)$adjr2, main = 'Adjusted R-square for best model \nfor each given number of predictors', xlab = 'Given number of predictors', ylab = 'Adjusted R-square') # adjusted R^2 chooses a different model
plot(summary(g)$cp, main = 'Mallows Cp for best model for each given number of predictors', xlab = 'Given number of predictors', ylab = 'Mallows Cp')
abline(a = 1, b = 1, lty = "dashed")
coef(g, id = 8)
(Intercept) sex1 age bii.di bit.di
100.3259546 8.2318518 -0.1607032 0.6926682 1.0475037
che.de ank.di wai.gi thi.gi
0.5658327 1.9707962 -0.1102385 -0.1905396
coef(g, id = 9)
(Intercept) sex1 age bii.di bit.di
98.9916772 8.8607781 -0.1649853 0.6965493 0.8197194
che.de ank.di wai.gi hip.gi thi.gi
0.5696904 1.9328071 -0.1763574 0.2492287 -0.3702857
mod1 = lm(hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + wai.gi + thi.gi, data=b)
summary(mod1)
Call:
lm(formula = hgt ~ sex + age + bii.di + bit.di + che.de + ank.di +
wai.gi + thi.gi, data = b)
Residuals:
Min 1Q Median 3Q Max
-16.1582 -4.0421 -0.0961 4.0007 15.5024
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 100.32595 4.75057 21.119 < 2e-16 ***
sex1 8.23185 1.00712 8.174 2.49e-15 ***
age -0.16070 0.03125 -5.142 3.92e-07 ***
bii.di 0.69267 0.15967 4.338 1.74e-05 ***
bit.di 1.04750 0.19986 5.241 2.36e-07 ***
che.de 0.56583 0.17507 3.232 0.00131 **
ank.di 1.97080 0.32074 6.145 1.64e-09 ***
wai.gi -0.11024 0.05175 -2.130 0.03364 *
thi.gi -0.19054 0.08670 -2.198 0.02843 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.678 on 498 degrees of freedom
Multiple R-squared: 0.6415, Adjusted R-squared: 0.6357
F-statistic: 111.4 on 8 and 498 DF, p-value: < 2.2e-16
mod2 = lm(hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + wai.gi + hip.gi + thi.gi, data=b)
summary(mod2)
Call:
lm(formula = hgt ~ sex + age + bii.di + bit.di + che.de + ank.di +
wai.gi + hip.gi + thi.gi, data = b)
Residuals:
Min 1Q Median 3Q Max
-15.6949 -4.2401 -0.0461 4.0656 16.3823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98.99168 4.77182 20.745 < 2e-16 ***
sex1 8.86078 1.04374 8.489 2.41e-16 ***
age -0.16499 0.03120 -5.288 1.85e-07 ***
bii.di 0.69655 0.15908 4.379 1.46e-05 ***
bit.di 0.81972 0.22472 3.648 0.000292 ***
che.de 0.56969 0.17442 3.266 0.001165 **
ank.di 1.93281 0.32000 6.040 3.02e-09 ***
wai.gi -0.17636 0.05977 -2.951 0.003320 **
hip.gi 0.24923 0.11399 2.186 0.029256 *
thi.gi -0.37029 0.11924 -3.105 0.002010 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.656 on 497 degrees of freedom
Multiple R-squared: 0.6449, Adjusted R-squared: 0.6385
F-statistic: 100.3 on 9 and 497 DF, p-value: < 2.2e-16
round(vcov(mod1), 4)
(Intercept) sex1 age bii.di bit.di
(Intercept) 22.5679 0.7052 -0.0018 -0.0640 -0.3125
sex1 0.7052 1.0143 0.0100 0.0313 -0.0073
age -0.0018 0.0100 0.0010 -0.0002 -0.0008
bii.di -0.0640 0.0313 -0.0002 0.0255 -0.0155
bit.di -0.3125 -0.0073 -0.0008 -0.0155 0.0399
che.de -0.0838 -0.0310 -0.0006 -0.0005 0.0014
ank.di -0.5838 -0.1555 -0.0008 -0.0066 -0.0118
wai.gi 0.0502 -0.0272 -0.0005 -0.0010 -0.0009
thi.gi -0.0910 0.0476 0.0011 0.0003 -0.0053
che.de ank.di wai.gi thi.gi
(Intercept) -0.0838 -0.5838 0.0502 -0.0910
sex1 -0.0310 -0.1555 -0.0272 0.0476
age -0.0006 -0.0008 -0.0005 0.0011
bii.di -0.0005 -0.0066 -0.0010 0.0003
bit.di 0.0014 -0.0118 -0.0009 -0.0053
che.de 0.0307 -0.0049 -0.0038 -0.0025
ank.di -0.0049 0.1029 0.0007 -0.0025
wai.gi -0.0038 0.0007 0.0027 -0.0019
thi.gi -0.0025 -0.0025 -0.0019 0.0075
round(vcov(mod2), 4)
(Intercept) sex1 age bii.di bit.di
(Intercept) 22.7703 0.5244 -0.0006 -0.0646 -0.2466
sex1 0.5244 1.0894 0.0094 0.0316 -0.0372
age -0.0006 0.0094 0.0010 -0.0002 -0.0006
bii.di -0.0646 0.0316 -0.0002 0.0253 -0.0155
bit.di -0.2466 -0.0372 -0.0006 -0.0155 0.0505
che.de -0.0842 -0.0302 -0.0006 -0.0005 0.0012
ank.di -0.5688 -0.1594 -0.0008 -0.0066 -0.0099
wai.gi 0.0683 -0.0357 -0.0005 -0.0010 0.0023
hip.gi -0.0696 0.0328 -0.0002 0.0002 -0.0119
thi.gi -0.0401 0.0236 0.0012 0.0001 0.0033
che.de ank.di wai.gi hip.gi thi.gi
(Intercept) -0.0842 -0.5688 0.0683 -0.0696 -0.0401
sex1 -0.0302 -0.1594 -0.0357 0.0328 0.0236
age -0.0006 -0.0008 -0.0005 -0.0002 0.0012
bii.di -0.0005 -0.0066 -0.0010 0.0002 0.0001
bit.di 0.0012 -0.0099 0.0023 -0.0119 0.0033
che.de 0.0304 -0.0049 -0.0038 0.0002 -0.0026
ank.di -0.0049 0.1024 0.0012 -0.0020 -0.0011
wai.gi -0.0038 0.0012 0.0036 -0.0034 0.0006
hip.gi 0.0002 -0.0020 -0.0034 0.0130 -0.0094
thi.gi -0.0026 -0.0011 0.0006 -0.0094 0.0142
vcov(mod2)["hip.gi", "thi.gi"]/sqrt(vcov(mod2)["hip.gi", "hip.gi"])/sqrt(vcov(mod2)["thi.gi", "thi.gi"])
[1] -0.68945
# -0.68945 is high enough to attract attention
plot(mod1, which=1:2)
hist(mod1$residuals, freq=F, breaks=30)
mod3 = lm(hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + wai.gi + thi.gi + sex:thi.gi, data=b)